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Abstract 

Modern statistical machine translation 
(SMT) systems usually use a linear com¬ 
bination of features to model the quality 
of each translation hypothesis. The linear 
combination assumes that all the features 
are in a linear relationship and constrains 
that each feature interacts with the rest fea¬ 
tures in an linear manner, which might 
limit the expressive power of the model 
and lead to a under-fit model on the cur¬ 
rent data. In this paper, we propose a non¬ 
linear modeling for the quality of transla¬ 
tion hypotheses based on neural networks, 
which allows more complex interaction 
between features. A learning framework is 
presented for training the non-linear mod¬ 
els. We also discuss possible heuristics 
in designing the network structure which 
may improve the non-linear learning per¬ 
formance. Experimental results show that 
with the basic features of a hierarchical 
phrase-based machine translation system, 
our method produce translations that are 
better than a linear model. 

1 Introduction 

One of the core problems in the research of statisti¬ 
cal machine translation is the modeling of transla¬ 
tion hypotheses. Each modeling method defines a 
score of a target sentence e = e±, e 2 , ■■■, e^,..., ej, 
given a source sentence f = fi, f2, ■■■, 
where each e, is the ith target word and fj is 
the jth source word. The well-known mod¬ 
eling method starts from the Source-Channel 
model ( [Brown et al., 1993) (Equation 1). The scor¬ 
ing of e decomposes to the calculation of a trans¬ 
lation model and a language model. 

Pr(e|f) = P?’(e)Pr(f |e)/P?’(f) (1) 


The modeling method is extended to log-linear 
models by Och and Ney (120021 . as shown in Equa¬ 
tion 2, where h m (e |f) is the mth feature function 
and A m is the corresponding weight. 

Pr(e|f) =p A M(e|f) 

= esp[£ff = i*mMe|f)] ( 2 ) 

£e' eX Pi'Em= 1 A m /l m (e / |f)] 

Because the normalization term in Equation 2 is 
the same for all translation hypotheses of the same 
source sentence, the score of each hypothesis, de¬ 
noted by sl, is actually a linear combination of all 
features, as shown in Equation [3j 

M 

s Ue) = X A m /i m (e|f) (3) 

m= 1 


The log-linear models are flexible to in¬ 
corporate new features and show significant 
advantage over the traditional source-channel 
models, thus become the state-of-the-art mod¬ 
eling method and are applied in various trans¬ 
lation settings (jYamada and Knight, 2001 


Koehn et ah, 2003 


Chiang, 2005 


Liu et al., 2006). 


It is worth noticing that log-linear models try to 
separate good and bad translation hypotheses us¬ 
ing a linear hyper-plane. However, complex inter¬ 
actions between features make it difficult to lin¬ 
early separate good translation hypotheses from 
bad ones ( [Clark et al., 2014| ). 

Taking features in a typical phrase-based ma¬ 
chine translation system ( jKoehn et al., 20031 ) as 
an example, the language model feature favors 
shorter hypotheses; the word penalty feature en¬ 
courages longer hypotheses. The phrase trans¬ 
lation probability feature selects phrases that oc¬ 
curs more frequently in the training corpus, which 
sometimes are long with lower translation prob¬ 
ability, as in translating named entities or id- 




























ioms; sometimes are short but with high trans¬ 
lation probability, as in translating verbs or pro¬ 
nouns. These three features jointly decide the 
choice of translations. Simply use the weighted 
sum of their values may not be the best choice for 
modeling translations. 

As a result, log-linear models may under-fit the 
data. This under-fitting may prevents the further 
improvement of translation quality. 

In this paper, we propose a non-linear model¬ 
ing of translation hypotheses based on neural net¬ 
works. The traditional features of a machine trans¬ 
lation system are used as the input to the net¬ 
work. By feeding input features to nodes in a hid¬ 
den layer, complex interactions among features are 
modeled, resulting in much stronger expressive 
power than traditional log-linear models. (Sec¬ 
tion [3]) 


Employing a neural network as non-linear 
models for SMT has two issues to be tackled. 
The first issue is the parameter learning. Log- 
linear models rely on minimum error rate train¬ 
ing (MERT) ( ]Och, 2003| to achieve best perfor¬ 
mance. When the scoring function become non¬ 
linear, the intersection points of these non-linear 
functions could not be effectively calculated and 
enumerated. Thus MERT is no longer suitable for 
learning the parameters. To solve the problem , 
we present a framework for effective training in¬ 
cluding several criteria to transform the training 
problem into a binary classification task, a unified 
objective function and an iterative training algo¬ 
rithm. (Section 0]l 


The second issue is the structure of neural net¬ 
work. Single layer neural networks are equivalent 
to linear models; two-layer networks with suffi¬ 
cient nodes are capable of learning any continuous 
function (Bishop, 1995] ). Adding more layers into 
the network could model complex functions with 
less nodes, but also brings the problem of van¬ 
ishing gradient (Erhan et al., 2009] ). We adapt a 
two-layer feed-forward neural network to keep the 
training process efficient. We notice that one ma¬ 
jor problem that prevent a neural network training 
reaching a good solution is that there are too many 
local minimums in the parameter space. Thus we 
discuss how to constrain the learning of neural net¬ 
works with our intuition and observations of the 
features. (Section [5]) 


Experiments are conducted to compare vari¬ 
ous settings and verify the effectiveness of our 


proposed learning framework. Experimental re¬ 
sults show that our framework could achieve better 
translation quality even with the same traditional 
features as previous linear models. (Section [6]) 


2 Related work 


Many research has been attempting to bring non¬ 
linearity into the training of SMT. These efforts 
could be roughly divided into the following three 
categories. 

The first line of research attempted to re¬ 
interpret original features via feature transforma¬ 
tion or additional learning. For example, Maskey 
and Zhou (120121) use a deep belief network to 
learn representations of the phrase translation and 
lexical translation probability features. Clark et 
al. (120141) used discretization to transform real¬ 
valued dense features into a set of binary indicator 
features. Lu et al. (120141) learned new features us¬ 
ing a semi-supervised deep auto encoder. These 
work focus on the explicit representation of the 
features and usually employ extra learning proce¬ 
dure. Our proposed method only take the original 
feature with no transformation as input. Feature 
transformation or combination are performed im¬ 
plicitly during the training of the network and inte¬ 
grated with the optimization of translation quality. 

The second line of research attempted to use 
non-linear models instead of log-linear models, 
which is most similar in spirit with our work. Duh 
and Kirchhoff (120081) used the boosting method 
to combine several results of MERT and achieved 
improvement in a re-ranking setting. Liu et 
al. (120131) proposed an additive neural network 
which employed a two-layer neural network for 
embedding-based features. To avoid local min¬ 
imum, they still rely on a pre-training and post¬ 
training from MERT or PRO. Comparing to these 
efforts, our proposed method takes a further step 
that it is integrated with iterative training, instead 
of re-ranking, and works without the help of any 
pre-trained linear models. 

The third line of research attempted to add 
non-linear features/components into the log- 
linear learning framework. Neural network 
based models are trained as language mod¬ 


els (Vaswani et al., 2013 Auli and Gao, 2014), 


translation models ({Gao et al., 20141) or joint lan 


guage and translation models (Auli et al., 20 13 


Devlin et ah, 20141 . Liu et al. (120131) also intro¬ 
duced word embedding for source and target side 



























of translation rule as local features. In this pa¬ 
per we focus on enhancing the expressive power 
of the modeling, which is independent of the re¬ 
search of enhancing translation system with new 
designed features. We believe additional improve¬ 
ment could be achieved by incorporating more fea¬ 
tures into our framework. 



3 Non-linear Translation 


The non-linear modeling of translation hypothe¬ 
ses could be used in both phrase-based system and 
syntax-based systems. In this paper, we take the 
hierarchical phrase based machine translation sys¬ 


tem (Chiang, 2005) as an example and introduce 


how we fit the non-linearity into the system. 

3.1 Decoding 

The basic decoding algorithm could 
be kept almost the same as traditional 
phrase-based or syntax-based transla¬ 


tion systems 

( 

Yamada and Knight, 2001 

Koehn et ah, 2003 


Chiang, 2005 

Liu et ah, 2006). 

For example, in the experi- 


ments of this paper, we use a CKY style decoding 
algorithm following Chiang (120051) . 

Our non-linear translation system is different 
from traditional systems in the way to calculate 
the score for each hypothesis. Instead of calculat¬ 
ing the score as a linear combination, we use neu¬ 
ral networks (Section IT2l) to perform a non-linear 
combination of feature values. 

We also use the cube-pruning algo¬ 


rithm (Chiang, 2005) to keep the decoding 
efficient. Although the non-linearity in model 
scores may cause more search errors in finding 
the highest scoring hypothesis, in practice it still 
achieves reasonable results. 


3.2 Two-layer Neural Networks 

We employ a two-layer neural network as the non¬ 
linear model for scoring translation hypotheses. 
The structure of a typical two-layer feed-forward 
neural network includes an input layer, a hidden 
layer, and a output layer (as shown in Figure [T}. 

We use the input layer to accept input features, 
the hidden layer to combine different input fea¬ 
tures, the output layer with only one node to out¬ 
put the model score for each translation hypothesis 
based on the value of hidden nodes. More specifi¬ 
cally, the score of hypothesis e, denoted as sn, is 


Figure 1: A two-layer feed-forward neural net¬ 
work. 


defined as: 


sjv(e) = a 0 (M 0 -a h (M h -h™(e\f)+b h )+b 0 ) (4) 


where M, b is the weight matrix, bias vector of 
the neural nodes, respectively; a is the activation 
function, which is often set to non-linear functions 
such as the tanh function or sigmoid function; sub¬ 
script h and o indicates the parameters of hidden 
layer and output layer, respectively. 


3.3 Features 

We use the standard features of a typi¬ 
cal hierarchical phrase based translation 
system! Chiang, 2005j). Adding new features 
into the framework is left as a future direction. 
The features as listed as following: 


• p(ctI t) and p("/\a): conditional probability 
of translating a as 7 and translating a as 7 , 
where a and 7 is the left and right hand side 
of a initial phrase (or hierarchical translation 
rule), respectively; 

• Pw(oil'y) and ( 71 a): lexical probability of 
translating words in a as words in 7 and 
translating words in 7 as words in cr; 

• pi m : language model probability; 

• wc: accumulated count of individual words 
generated during translation; 

• pc: accumulated count of initial phrases used; 

• rc: accumulated count of hierarchical rule 
phrases used; 

• gc: accumulated count of glue rule used in 
this hypothesis; 

• uc: accumulated count of unknown source 
word; 























• nc: accumulated count of source phrases that 
translate into null; 

4 Non-linear Learning Framework 


of e and doesn’t concern about the ranking 
of rest hypothesis. In this case, the n-best set 
C n best is used to approximate C, and e to ap¬ 
proximate e. 


Traditional machine translation systems rely on 
MERT to tune the weight of different features. 
MERT performs efficient search by enumerating 
the score function of all the hypotheses and us¬ 
ing intersections of these linear functions to form 
the ’’upper-envelope” of the model score func¬ 
tion ( |Och, 2003] ). When the scoring function is 
non-linear, it is not feasible to find the intersec¬ 
tions of these functions. In this section, we discuss 
alternatives to train the parameter for non-linear 
models. 


4.1 Training Criteria 

The task of machine translation is a complex prob¬ 
lem with structural output space. Decoding algo¬ 
rithms search for the translation hypothesis with 
the highest score, according to a given scoring 
function, from an exponentially large set of candi¬ 
date hypotheses. The puipose of training is to se¬ 
lect the scoring function, so that the function score 
the hypotheses ’’correctly”. The correctness is of¬ 
ten introduced by some extrinsic metrics, such as 


BLEU (Papineni et al., 200 2). 

We denote the scoring function as s(f, e; 9), or 
simply s, which is parametrized by 0: denote the 
set of all candidate hypotheses as C\ denote the ex¬ 
trinsic metric as eval(-). Note that, in linear cases, 
s is a linear function as in Equation [3] while in 
the non-linear case described in this paper, s is the 
scoring function in Equation [4] 

Ideally, the training objective is to select a scor¬ 
ing function s, from all functions 5, that scores 
the correct translation (or references), denoted as 
e, higher than any other hypotheses (Equation [5]). 


s = {sE S|s(e) > s(e) Ve € C} 


(5) 


In practice, the candidate set C is exponentially 
large and hard to enumerate; the correct translation 
e may not even exist in the current search space for 
various reasons, e.g. unknown source word. As a 
result, we seek the following three alternatives as 
approximations to the ideal objective. 

Best v.s. Rest (BR) To score the best hypothesis 
in the n-best set e higher than the rest hy¬ 
potheses. This objective is very similar to 
MERT in that it tries to optimize the score 


Best v.s. Worst (BW) To score the best hypoth¬ 
esis higher than the worst hypothesis in the 
n-best set. This objective is motivated by the 
practice of separating the ’’hope” and ’’fear” 


translation hypothesis (Chiang, 2012 1 . We 
take a simpler strategy which uses the best 
and worst hypothesis in C n b es t as the ’’hope” 
and ’’fear” hypothesis, respectively, in order 
to avoid multi-pass decoding. 

Pairwise (PW) To score the better hypotheses 
in sampled hypothesis pair's higher than the 
worse ones in the same pair. This objective 
is adapted from the Pairwise Ranking Op¬ 
timization (PRO) (Hopkins and May, 2011]), 
which tries to ranking all the hypotheses in¬ 
stead of selecting the best one. We use the 
same sampling strategy as their original pa¬ 
per. 


Note that each of the above criterions trans¬ 
forms the original problem of selecting best hy¬ 
potheses from an exponential space to a certain 
pair-wise comparison problem, which could be 
easily trained as standard binary classifiers. 


4.2 Training Objective 

For the binary classification task, we use a hinge 
loss following Watanabe (!2012b . Because the net¬ 
work has a lot of parameters compared with the 
lineal' model, we use a L\ norm instead of L 2 
norm as the regularization term, to favor sparse so¬ 
lutions. We define our training objective function 
in Equation 0 

arg imn — E E £(fjer,e 2 ;0) + A • ||0||i 

f €D (ei ,e 2 )€T 

with 

£(•) = max{s{ f, ei; 9) — s( f, e 2 ; 9) + 1,0} 

( 6 ) 


D is the given training data; (ei,e 2 ) is a train¬ 
ing hypothesis-pair, with the assumption that ei is 
the one with higher eval(-) score; N is the total 
number of hypothesis-pairs in D\ T is the set of 
hypothesis-pairs for each source sentence. 

The set T is decided by the criterion used for 
training. For the BR setting, the best hypothesis is 












paired with every other hypothesis in the n-best list 
(Equation O; while for the BW setting, it is only 
paired with the worst hypothesis (Equation [8]). The 
generation of T in PW setting is the same with 
PRO sampling, we refer the readers to the original 
paper of Hopkins and May (1201 lb . 

Tbr = {(ei,e 2 )|ei = arg max eval(e), 

G^Cnbest ( 7 ) 

e 2 e C nbest andei / e 2 } 

Tbw = {(ei,e 2 )|ei = arg max eval(e), 

nbest 

e 2 = arg min eval(e )} 

nbest 

( 8 ) 

4.3 Training Procedure 

In standard training algorithm for classifica¬ 
tion, the training instances stays the same in 
each iteration. In machine translation, decod¬ 
ing algorithms usually return a very different 
n-best set with different parameters. This is 
due to the exponentially large size of search 
space. MERT and PRO extend the current n- 
best set by merging the n-best set of all previ¬ 


ous iterations into a pool (Papineni et ah, 2002 


Hopkins and May, 2011 (. In this way, the en¬ 
larged n-best set may give a better approximation 
of the true hypothesis set C and may lead to better 
and more stable training results. 

We argue that the training should still focus on 
hypotheses obtained in current round, because in 
each iteration the searching for the n-best set is in¬ 
dependent of previous iterations. To compromise 
the above two goals, in our practice, training hy¬ 
pothesis pairs are first generated from the current 
n-best set, then merged with the pairs generated 
from all previous iterations. In order to make the 
model focus more on pairs from current iteration, 
we assign pairs in previous iterations a small con¬ 
stant weight and assign pairs in current iteration a 
relatively large constant weight. This is inspired 
by the AdaBoost algorithm ( Schapire, 1999[ ) in 
weighting instances. 

Following the spirit of MERT, we propose a it¬ 
erative training procedure (Algorithm 1). 

As shown in Algorithm 1, the training proce¬ 
dure starts by randomly init model parameters 6° 
(line 1). In ith iteration, the decoding algorithm 
decodes each sentence f to get the n-best set C n best 
(line 5). Training hypothesis pairs T are extracted 
from C n b es t according to the training criterion de¬ 
scribed in Section l4~2l (line 6). New collected pairs 


Algorithm 1 Iterative Training Algorithm 
Input: the set of training sentences D, max num¬ 
ber of iteration / 

1: 6 0 RandomInit(), 

2: for i = 0 to / do 
3: Ti <— 0 ; 

4: for each f G D do 

5: C n best NbestDecode(f ; 8 l ) 

6: T <— GcncratePai i' ( C n b es t ) 

7: Ti Ti LIT 

8 : end for 

9: T aU <- WeightedCombine (U^T*., T ) 

10: 9 l+l <r- Optimize (T a u, 6 l ) 

11: end for 


T, are combined with pairs from previous itera¬ 
tions before used for training (line 9). 8 l+1 is ob¬ 
tained by solving Equation [6] using the Conjugate 
Sub-Gradient method ( |Le et al., 201 1| ) (line 10). 

5 Structure of the Network 

Although neural networks bring strong expressive 
power to the modeling of translation hypothesis, 
training a neural network is prone to resulting in 
local minimum which may affect the training re¬ 
sults. We speculate that one reason for these lo¬ 
cal minimums is the structure of a well-connected 
network has too many parameters. Take a neu¬ 
ral network with k nodes in the input layer and m 
nodes in the hidden layer as an example. Every 
node in the hidden layer is connected to each of 
the k input nodes. This simple structure resulting 
in at least k x m parameters. 

In Section 14.21 we use L\ norm in the objec¬ 
tive function in order to get sparser solutions. In 
this section, we propose some constrained network 
structures according to our prior knowledge of the 
features. These structures have much less param¬ 
eters or simpler structures comparing to original 
neural networks, thus reduce the possibility of get¬ 
ting stuck in local minimums. 

5.1 Network with two-degree Hidden Layer 

We find the first pitfall of the standard two-layer 
neural network is that each node in the hidden 
layer receives input from every input layer node. 
Features used in SMT are usually manually de¬ 
signed, which has their concrete meanings. For a 
network of several hidden nodes, combining every 
features into every hidden node may be redundant 















and not necessary to represent the quality of a hy¬ 
pothesis. 

As a result, we take a harsh step and constrain 
the nodes in hidden layer to have a in-degree of 
two, which means each hidden node only accepts 
inputs from two input nodes. We do not use any 
other prior knowledge about features in this set¬ 
ting. So for a network with k nodes in the in¬ 
put layer, the hidden layer should contain C 2 = 
k(k — l)/2 nodes to accept all combinations from 
the input layer. We name this network structure as 
Two-Degree Hidden Layer Network (TDN). 

It is easy to see that a TDN has x 2 = 
k(k — 1) parameters for the hidden layer because 
of the constrained degree. This is one order of 
magnitude less than a standard two-layer network 
with the same number of hidden nodes, which has 
Ck x k = k 2 (k — l)/2 parameters. 

Note that we perform a 2-degree combination 
that looks similar in spirit with those combina¬ 
tion of atomic features in large scale discrimina¬ 
tive learning for other NLP tasks, such as POS tag¬ 
ging and parsing. However, unlike the practice in 
these tasks that directly combines values of differ¬ 
ent features to generate a new feature type, we first 
linearly combine the value of these features and 
perform non-linear transformation on these values 
via an activation function. 

5.2 Network with Grouped Features 

It might be a too strong constraint to require the 
hidden node have in-degree of 2. In order to re¬ 
lax this constraint, we need more prior knowl¬ 
edge from the features. Our first observation is 
that there are different types of features. These 
types are different from each other in terms of 
value ranges, sources, importance, etc. For exam¬ 
ple, language model features usually take a very 
small value of probability, and word count feature 
takes a integer value and usually has a much higher 
weight in linear case than other count features. 

The second observation is that features in the 
same group are basically of the same type and 
may not have complex interaction with each other. 
For example, it is reasonable to combine language 
model features with word count features in a hid¬ 
den node. But it may not be necessary to combine 
the count of initial phrases and the count of un¬ 
known words into a hidden node. 

Based on the above two intuitions, we design 
a new structure of network that has the following 


constraints: given a disjoint partition of features: 
Gi, Gi, Gfc, every hidden node takes input from a 
set of input nodes, where any two nodes in this set 
come from two different feature groups. We name 
this network structure as Grouped Network (GN). 

In practice, we divide the basic features in Sec- 
tion l3.3l into five groups: language model features, 
translation probability features, lexical probability 
features, the word count feature, and the rest of 
count features. 

6 Experiments and Results 

6.1 General Settings 

We conduct experiments on a large scale machine 
translation tasks. The parallel data comes from 
LDC, including LDC2002E18, LDC2003E14, 
LDC2004E12, LDC2004T08, LDC2005T10, 

LDC2007T09, which consists of 8.6 million 
of sentence pair's. Monolingual data includes 
Xinhua portion of Gigaword corpus. We use 
multi-references data MT03 as training data, 
MT02 as development data, and MT04, MT05 
as test data. These data are mainly in the same 
genre, avoiding the extra consideration of domain 
adaptation. 


Data 

Usage 

Sents. 

LDC 

TM train 

8,260,093 

Gigaword 

LM train 

14,684,074 

MT03 

train 

919 

MT02 

dev 

878 

MT04 

test 

1,789 

MT05 

test 

1,083 


Table 1: Experimental data and statistics. 


The Chinese side of the corpora is word seg¬ 
mented using ICTCLAsQ. Our translation system 
is an in-house implementation of the hierarchical 
phrase-based translation system! Chiang, 2005). 
We set the beam size to 20. We train a 5-gram 
language model on the monolingual data with 
MKN smoothing dChen and Goodman, 1998j ). For 
each parameter tuning experiments, we ran the 
same training procedure 3 times and present 
the average results. The translation qual¬ 
ity is evaluated use 4-gram case-insensitive 
BLEU (Papineni et al., 2002). Significant test 
is performed using bootstrap re-sampling imple¬ 
mented by Clark et al. (1201 111 . We employ a two- 
layer neural network with 11 input layer nodes, 

1 http://ictclas.nlpir.org/ 



















Criteria 

MT03(train) 

MT02(dev) 

MT04 

MT05 

BR C 

35.02 

36.63 

34.96 

34.15 

BR 

38.66 

40.04 

38.73 

37.50 

BW 

39.55 

39.36 

38.72 

37.81 

PW 

38.61 

38.85 

38.73 

37.98 


Table 2: BLEU4 in percentage on different training criteria (”BR”, ”BW” and ”PW” refer to experiments 
with "Best v.s. Rest”, "Best v.s. Worst” and ’’Pairwise” training criteria, respectively. "BR C ” indicates 
generate hypothesis pairs from n-best set of current iteration only presented in Section l4~3l 


corresponding to features listed in Section [331 and 
1 output layer node. The number of nodes in the 
hidden layer varies in different settings. The sig¬ 
moid function is used as the activation function for 
each node in the hidden layer. For the output layer 
we use a linear activation function. We try differ¬ 
ent A for the Li norm from 0.01 to 0.00001 and 
use the one with best performance on the develop¬ 
ment set. We solve the optimization problem with 
ALGLIB packaged. 


6.2 Experiments of Training Criteria 


This set experiments evaluates different training 
criteria discussed in Section 14.11 We generate 
hypothesis-pair according to BW, BR and PW cri¬ 
teria, respectively, and perform training with these 
pairs. In the PW criterion, we use the sampling 
method of PRO ( [Hopkins and May, 2011) and get 
the 50 hypothesis pairs for each sentence. We use 
20 hidden nodes for all three settings to make a 
fair comparison. 

The results are presented in Table [2] The 
first two rows compare training with and with¬ 
out the weighted combination of hypothesis pairs 
we discussed in Section 14.31 As the result sug¬ 
gested, with the weighted combination of hypothe¬ 
sis pairs from previous iterations, the performance 
improves significantly on both test sets. 

Although the system performance on the dev 
set varies, the performance on test sets are al¬ 
most comparable. This suggest that although the 
three training criteria are based on different as¬ 
sumptions, their are basically equivalent for train¬ 
ing translation systems. 

We also compares the three training criteria in 
their number of new instances per iteration and 
final training accuracy (Table [3]). Compared to 
BR which tries to separate the best hypothesis 
from the rest hypotheses in the n-best set, and PW 
which tries to obtain a correct ranking of all hy- 


Criteria 

Pairs/iteration 

Accuracy(%) 

BR 

19 

70.7 

BW 

1 

79.5 

PW 

100 

67.3 


Table 3: Comparison of different training criteria 
in number of new instances per iteration and train¬ 
ing accuracy. 


potheses, BW only aims at separating the best and 
worst hypothesis of each iteration, which is a eas¬ 
ier task for learning a classifiers. It requires the 
least training instances and achieves the best per¬ 
formance in training. Note that, the accuracy for 
each system in Tableware the accuracy each sys¬ 
tem achieves after training stops. They are not cal¬ 
culated on the same set of instances, thus not di¬ 
rectly comparable. We use the differences in accu¬ 
racy as an indicator for the difficulties of the cor¬ 
responding learning task. 

For the rest of this paper, we use the BW crite¬ 
rion because it is much simpler compared to sam¬ 
pling method of PRO ([Hopkins and May, 2011). 


6.3 Experiments of Network Structures 

We make several comparisons of the network 
structures and compare them with a baseline hi¬ 
erarchical phrase-based translation system (HPB) 
(Table®. 

We first compares the neural network with dif¬ 
ferent number of hidden nodes. The systems 
TFayer 2 o, TLayerjo and TLayei '50 are standard 
two-layer feed forward neural network with 20 , 
30 and 50 hidden layer nodc.d We can see that 
training a larger network do lead to an improve¬ 
ment in translation quality. However training a 
larger network is often time-consuming. We ex¬ 
perimented with neural networks with 100 and 
more hidden nodes (TFayerioo )• But TFayer 3 o 
takes 10 times longer in training time for each iter- 


2 http://www.alglib.net/ 


3 TLayer 2 o is the same system as BW in TableQ 




















Systems 

MT03 (train) 

MT02(dev) 

MT04 

MT05 

TestAverage 

HPB 

39225+ 

39.07 

38.81 

38.01 

38.41(-) 

TLayer 20 

39.55* 

39.36* 

38.72 

37.81 

38.27(-0.14) 

TLayer 30 

39.70+ 

39.71* 

38.89 

37.90 

38.40(-0.01) 

TLayerso 

39.26 

38.97 

38.72 

38.79+ 

38.76(+0.35) 

TDN 

39.60+ 

38.94 

38.99* 

38.13 

38.56(+0.15) 

GN 

39.73+ 

39.41+ 

39.45+ 

38.51+ 

38.98(+0.57) 


Table 4: BLEU4 in percentage for comparing of systems using different network structures (HPB refers 
to the baseline hierarchical phrase-based system. TLayer , TDN, GN refer to the standard 2-layer net¬ 
work, Two-Degree Hidden Layer Network, Grouped Network, respectively. Subscript of TLayer indi¬ 
cates the number of nodes in the hidden layer.) + , * marks results that arc significant better than the 
baseline system with p < 0.01 and p < 0.05. 


ation than TLayer 2 o and did not finish by the time 
of submission deadline. 

We then compared the two network structures 
proposed in Section [5] The Two-Degree Hidden 
Layer Network (TDN) already perform compara¬ 
ble to the baseline system. But it constrain all in¬ 
put to the hidden node to be of degree 2, which is 
likely to be too restrictive. With the grouped fea¬ 
ture, we could design networks such as GN, which 
shows significant improvement over the baseline 
systems and achieves the best performance among 
all neural systems. Note that GN is in a much 
larger scale, but is also sparse in parameters and 
takes significant less training time than standard 
neural networks. 

7 Conclusion 

In this paper, we discuss a non-linear framework 
for modeling translation hypothesis for statisti¬ 
cal machine translation system. We also present 
a learning framework including training criterion 
and algorithms to integrate our modeling into a 
state of the art hierarchical phrase based machine 
translation system. Compared to previous effort 
in bringing in non-linearity into machine transla¬ 
tion, our method uses a single two-layer neural 
networks and performs training independent with 
any previous linear training methods (e.g. MERT). 
Our method also trains its parameters without any 
pre-training or post-training procedure. Experi¬ 
ment shows that our method could improve the 
baseline system even with the same feature as 
input, in a large scale Chinese-English machine 
translation task. 

In training neural networks with hidden nodes, 
we use heuristics to reduce the complexity of net¬ 
work structures and obtain extra advantages over 


standard networks. It shows that heuristics and in¬ 
tuitions of the data and features are still important 
to a machine translation system. 

As future work, it is necessary to integrate more 
features into our learning framework. It is also in¬ 
teresting to see how the non-linear modeling fit in 
to more complex learning tasks which involves do¬ 
main specific learning techniques. 
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